Let's scrape the IRE homepage

Our goal: Print out the headlines from the IRE home page.

requests is a handy third-party library for making HTTP requests. It does the same thing your browser does when you type in a URL and hit enter -- sends a message to a server and requests a copy of the page -- but it allows us to do this programatically instead of pointing and clicking. For our purposes today, we're interested in the library's get() method.

Import the libraries



In [ ]:

    
import requests
from bs4 import BeautifulSoup

Fetch and parse the HTML



In [ ]:

    
# use the `get()` method to fetch a copy of the IRE home page
ire_page = requests.get('http://ire.org')

# feed the text of the web page to a BeautifulSoup object
soup = BeautifulSoup(ire_page.text, 'html.parser')

Target the headlines

View source on the IRE homepage and find the headlines. What's the pattern?



In [ ]:

    
# get a list of headlines we're interested in
heds = soup.find_all('h1', {'class': 'title1'})

Loop over the heds, printing out the text

You can drill down into a nested tag using a period.



In [ ]:

    
for hed in heds:
    print(hed.a.string)

Exercise: Print the links

Your mission: Loop over the headlines and print the links (the href portion of the tag) for each one. You can access tag attributes like you'd access values in a dictionary. (This might require some Googling.)



In [ ]: